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Abstract 


Isotonic regression (IR) is a non-parametric calibration method used in super¬ 
vised learning. Eor performing large-scale IR, we propose a primal-dual active-set 
(PDAS) algorithm which, in contrast to the state-of-the-art Pool Adjacent Viola¬ 
tors (PAV) algorithm, is easily warm-started and can be parallelized. This warm¬ 
starting capability makes it well-suited for online settings. We prove that, like the 
PAV algorithm, our PDAS algorithm for IR is convergent and has a work complex¬ 
ity of 0{n), though our numerical experiments suggest that our PDAS algorithm 
is often faster than PAV. In addition, we propose PDAS variants (with safeguarding 
to ensure convergence) for solving related trend filtering (TP) problems, providing 
the results of experiments to illustrate their effectiveness. 


1 Introduction 

Isotonic regression is a non-parametric method for fitting an arbitrary monotone function to a dataset 
ffllll that has recently gained favor as a calibration method for supervised learning El a a a. a 
well-known and efficient method for solving IR problems is the Pool Adjacent Violators (PAV) al¬ 
gorithm IT). This method is easily implemented and enjoys a convergence guarantee with a work 
complexity of 0{n) where n is the dimension of the dataset. A drawback of the PAV algorithm in 
large-scale settings, however, is that it is inherently sequential. Consequently, in order to exploit 
parallelism, one has to resort to decomposing the IR problem [8], where deciding the number of 
processors is nontrivial. Eor example, a recent Spark implementation of a parallelized PAV method 
suffers from significant overhead ||9]. In addition, since the PAV algorithm must be initialized from 
a particular starting point, it cannot be warm-started—a fact that is especially detrimental when a se¬ 
quence of IR problems need to be solved im, such as in online settings where data points are added 
continually. As an alternative, we propose a primal-dual active-set (PDAS) method for solving IR 
problems. Our PDAS algorithm also has a convergence guarantee for IR and a work complexity of 
0{n), but can be warm-started and is easily parallelized. We also provide PDAS algorithm variants 
for a related class of trend filtering (TP) problems. Alternative approaches for certain TP problems 
include interior point methods CD, specialized ADMM methods d, and proximal methods CD, 
but advantages of active-set methods such as ours are that they may terminate finitely and often 
yield very accurate solutions. Eor reasons such as these, they may be favorable for certain appli¬ 
cations such as switch point identification and time series segmentation. The results of numerical 
experiments are provided for all of our algorithm variants to illustrate their practical strengths. 

This paper is organized as follows. In ^ we define and describe the IR and TP problems for which 
our algorithms are designed. In ^ we summarize and compare the well-known Pool Adjacent 
Violators (PAV) algorithm and our proposed primal-dual active-set (PDAS) method for solving IR 
problems. PDAS variants (with and without safeguards) for solving related TP problems are pre- 
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sented in Q We report on our experimental results as well as our findings with other approaches 
in Finally, concluding remarks are provided in 

2 Problem Descriptions 

Isotonic Regression We consider the isotonic regression (IR) problem 

n 

min i ojiiui - Oi)'^ subject to 9i < ... <9^, (IR) 

i=l 

where y G K" represents observed data and w S K” represents weights for the data fitting term. 
The goal of this optimization problem formulation is to determine a monotonically increasing step 
function that matches the observed data as closely as possible in a sense of distance defined by w. 

Trend Filtering Problem ( |lR] i is related to a special case of the trend filtering problem 

min t^(9), where (p(9) = f{9) + Ap(0), (TF) 

/ : K" I—K is smooth and convex, A > 0 is a regularization parameter, and the regularization func¬ 
tion g : K" I—M is convex but not necessarily smooth. In trend filtering or time series segmentation, 
/ is usually chosen to measure the distance between 9 and y while g imposes desired properties on 
the solution 9. A typical trend filtering problem has the form 

n 

/(^) = ~ = \\D9h or g{9) = ||(D6»)+||i, 

i=l 

where D is a first-order (or higher) difference operator and ( 7 )+ = max{ 7 , 0} (component-wise). 
Specifically, as described in ifTTl . a fc-th order difference matrix g is defined 

recursively via the relation For example, the first and second order 

difference matrix and defined, respectively, as 


T -1 


T -2 1 

1 -1 

and = 

1 -2 1 

1 -1_ 


1 -2 1_ 


When g{9) = ||(D^^’")0)_|.||i and A is sufficiently large, solving ( |TF| l gives the solution of ( |lRl l. 

3 Algorithms for Isotonic Regression 

We describe and compare two efficient algorithms for solving problem ( |IR] ); in particular, the PAV 
and our proposed PDAS algorithms are described and their corresponding theoretical properties are 
summarized in p.l| and p.2| r espectively. Throughout this and the subsequent sections, we borrow 
the following notation from 171 in the algorithm descriptions. 

Notation Let J represent a partition of the variable indices {l,2,...,n} into ordered blocks 
{Bi, B 2 , ■ ■ ■} where each block consists of consecutive indices, i.e., each block has the form 
{p,... ,q} for p < q. The immediate predecessor (successor) of block B is denoted as B- {B+). 
By convention, B_ (i?+) equals 0 when B is the initial (final) block. The weighted average of 
the elements of y in block B is denoted Av{B) := {J2i=p'^iyi)/ index 
i G B — {p, ..., g}, we define the “lower” and “upper” sets 

J) = {p,p+l,...,i} and U^{J) = {i + l,i + 2,..., q}. 

Hereinafter, we shall use Li and Ui for brevity when their dependence on a particular J is clear. 
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3.1 PAV for Isotonic Regression 

We describe briefly the PAV algorithm ll7lfn ri4llT5l l^f8l. state its main theoretical properties, and 
discuss its important features in this section. To start with, we rephrase the PAV algorithm from fT], 
where the algorithm is shown to replicate a dual active-set method for quadratic optimization. 


Algorithm 1 PAV for Isotonic Regression 


1 

Input the initial partition J = {n}} and set C = {1} 


2 

loop 


3 

if C+ = 0 then 


4 

Terminate and return 6i = Av(B) for each i £ B for each B £ J 


5 

if Av(C) < Av(C+) then 


6 

Set C ^C+ 


7 

else 


8 

Set J £- {J\{C, C+l) U (C U C+) and C ^ C U C+ 

o Merge C with C+ 

9 

while Av(C-) > Av(C') and C- / 0 do 


10 

Set J ^ (J\{C-, C}) U (C- U C) and C ^ C- U C 

l> Merge C with C- 


The main idea of Algorithm [T] can be understood as follows. Initially, each index is represented by a 
separate block. The algorithm then sequentially visits all blocks, merging a block with its successor 
whenever a “violator” is met, i.e., whenever a block has a weighted average greater than its successor. 
Once any merge occurs, the algorithm searches backwards to perform subsequent merges in order to 
ensure that, at the end of any loop iteration, no violators exist up to the furthest visited block. Once 
all blocks have been visited, no violator exists and the solution 9 will be monotonically increasing. 

An impressive property of Algorithm [T] is that by storing an intermediate value for each block and 
showing that at most n merge operations may occur, one obtains an efficient implementation that 
solves problem ( |lRl l within 0{n) elementary arithmetic operations ifTSlI . Due to this fact and its good 
practical performance. Algorithm has been popular since its invention. However, we argue that 
the PAV algorithm does have critical drawbacks when it comes to solving large-scale problems of 
interest today. First, Algorithmj^must be initialized with J = {{!},..., {n}}, which is unfortunate 
when one has a better initial partition, such as when one is solving a sequence of related instances 
of(l^ Co). In addition, the sequential nature of PAV makes it difficult to leverage multi-processor 
infrastructures. 


3.2 PDAS for Isotonic Regression 

Primal-dual active-set (PDAS) methods have been proposed in the literature for solving Linear Com¬ 
plementarity Problems (LCPs) ifTbll . bound-constrained QPs (BQPs) iflTl . and generally-constrained 
QPs ifTSl . To our knowledge, however, the application and theoretical analysis of a tailored PDAS 
method for solving problem has not previously been studied. 

In this section, we propose a PDAS method designed explicitly for and discuss its theoretical 
guarantees and practical benefits. We first reveal the relationship between problem ( |lRl l and a special 
class of convex BQPs for which a primal-dual active-set (PDAS) method is known to be well-suited. 
We then propose our PDAS algorithm tailored for solving Finally, the complexity of our 

proposed algorithm is analyzed and its key features and properties are discussed. 


( |IR| i and ( |BQP| i Let H = diag(a;i,..., w„) be the diagonal weight matrix. The dual of ( |Ir1 i is then 

min ^z^Dn-^D'^z - y'^D^z, (BQP) 


where 




2 

aJ2 


2 

aJ2 


0 

1 

1^3 


0 —^ ^ 


0 0 



0 

0 

0 


and Dy = 


yi - V2 

2/2 - 2/3 



Vn — l y-n 
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Since DQ,~^D^ is positive definite with non-positive off-diagonal entries, it is an M-matrix, mean¬ 
ing that ( |BQP| i has a form for which a PDAS method is well-suited ifTTll . In descriptions of PDAS 
methods for BQP such as that in IfTTll . the notion of a partition corresponds to a division of the index 
set for z into “active” and “inactive” sets. There is a one-to-one correspondence between a partition 
of ( |BQP| ) and that of ( |lR] l; specifically, the non-zero indices in z correspond to the boundaries of the 
blocks of a partition for problem We have the following result for the present setting. 
Theorem 3.1 (Theorem 3.2, ifTTl f. If the PDAS method from 4771/ is applied to solve problem ( |BQPl i, 
then the iterate sequence {zk} is nondecreasing, has Zk > 0 for all k > 1, and converges to the 
optimal solution of (|BQP|l. 


A PDAS Algorithm for ( |I^ One can apply the PDAS method from ifTTl to solve ( |IR] i by applying 
the approach to its dual ( |BQP|i. However, a straightforward application would fail to exploit the 
special structure of problem ( IR)!. A lgorithm [2] on the other hand, generates the same sequence of 
iterates as the PDAS method of 1171. but is written in a much more computationally efficient form. 


Algorithm 2 PDAS for Isotonic Regression 
1: Input an initial partition Jo 
2: For each i £ B = {p, ■ ■ ■ ,q} € Jo, set 


E.gfl 


and Zi t— 


- Oi) 

0 

Zi-\ -I- uji{yi 


e^) 


ifi = p 
if i = q f n 
otherwise 


3: Initialize Ji t— Jo 

4: for each i £ B £ J\ with Zi < 0, set Ji t— {Ji\B) U {Li, Ui} i> Split B into Li and Ui 

5: for each B £ Ji, set as £- J2ieB Pb Jfi^g Lb £- olbI{Ib 

6: for /c = 1, 2,... do 
7: Initialize Jfe+i •<— Jh 

8: for each [B^,... ,Bt} C Jj. with p.B„_i < Pb, > ■ ■ ■ > pst < PSt+i ^ Merge {Ba ,..., Bt} 


t 

Let A t- [J Bj and update J^+i t- (Jfc+i U N)\{Bs,Bt}, 

j-s 

t t 

Set ajv E CtBj , Pn ■£- ’^2, PBj ’ LN ■£- OiN /Pn 

j=S j=3 


9: if Jfe+i = Jfe, then break 

10: for each B £ J^, set 6i <— ps for alii £ B 


The major difference between Algorithms [T] and is the manner in which blocks are merged. In 
Algorithm [T] each merge involves a single pair of adjacent blocks, whereas Algorithm [2 allows 
simultaneous merge operations, each of which may involve more than two blocks. In the fo lowing 
theorem, we prove that Algorithm like the careful implementation lITSll of Algorithm [T solves 
problem ( |IR] ) in 0{n) elementary arithmetic operations. 

Theorem 3.2. If Algorithm^is applied to solve problem ( |IR| l, then it will yield the optimal solution 
/or® within 0{n) elementary arithmetic operations. 


Proof One can verify that Algorithm is equivalent to applying the PDAS method from IfTTll to 
solve ( |BQP| i, from which it follows by Theore m|3.1| that it finds the optimal solution in hnitely many 
iterations. The initialization process in Steps U|^ requires 0{n) elementary arithmetic operations 
as each step involves at most a constant number of calculations with each value from the dataset. 
As for the main loop involving Steps |6]^ the introduction of a and (3 ensures that the number of 
elementary arithmetic operations in merging two blocks becomes 0{1). Thus, since the for loop 
only involves merge operations and there can be at most n merges, the desired result follows. □ 


3.3 Further Discussion 

Algorithm [^enjoys several nice features that AlgorithmfTland other relevant algorithms do not pos¬ 
sess. For one thing, the initial partition Jq of Algorithmj^can be chosen arbitrarily. This allows the 
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algorithm to be warm-started if one has a good optimal partition estimate. This feature is particularly 
appealing when a sequence of related instances of ( |IR] i need to be solved ifTOll . 

Another interesting feature of Algorithm]^ is its potential to be parallelized. This is possible since 
Algorithm [fallows for multiple independent merge operations in each iteration; see Step 8 of Al¬ 
gorithm ^ As an illustrative example with y = {6,4, 2, 9,11,4} and w = e, we demonstrate in 
Figure [T the different behavior between Algorithm]^ and Algorithm[T]when applied on this data set. 




Figure 1: Illustration of per-iteration merge operations in PAV (left) and PDAS (right). 

As illustrated in Figure[2 each iteration of PAV only merges two consecutive blocks whereas PDAS 
allows multiple consecutive blocks to be merged simultaneously throughout the dataset. Notice also 
that PDAS only requires three division operations while PAV requires four, despite the fact that in 
both methods the number of merge operations are counted as four. 


4 PDAS for Trend Filtering 


The trend filtering problem ( |TF| l can be viewed as a generalization of problem ( |lRl l. While ( |IR| ) 
imposes monotonicity on the solution vector 9, variants of can impose other related properties, 
as illustrated in §4. 1| Consequently, it is natural to extend PDAS for solving ( |TF| i, as we do in (4.2' 


However, since a direct application of a PDAS method may cycle when solving certain versions 
of ( |TF[ ), we propose safeguarding strategies to ensure convergence; see 


4.1 Regularization with Difference Operators 

Common choices for the regularization function in problem ( |TF| i are g{9) = gi{e) := i 

or g{9) = gi+{9) = ||(Zl^‘^’")0) + ||i, where G is the d-order difference matrix 

on K". The choice of the regularization function determines the properties that one imposes on 6. 
We illustrate the typical behavior of 9 for different choices of the regularization in Figure]^ 

As shown in Figure!^ when g = the fitted variable 9 has the property of being nearly-monotone 
and nearly-convex for d = 1 and d = 2, respectively. Similarly, when g — gi, the fitted curve 
would be piecewise constant and piecewise linear for d = 1 and d = 2, respectively. Higher order 
difference operators may be used, though the first and second order ones are more widely used. 


4.2 A PDAS Framework for Trend Filtering 

For brevity, let D := Denote the optimal solution of problem ( |TF| l as {9*, z*). Correspond¬ 

ing to this optimal solution, we may partition the indices of D9* as follows: 

r* = {j : {D9*), > 0}; = {j : iD9*)j < 0}; A* = {j : {D9*), = 0}. 
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Figure 2; Trend filtering solutions for different choices of g and 


A typical PDAS framework consists of three steps, as shown in Algorithm]^ subspace minimization 
(SSM), termination check, and partition update. We now discuss each of these steps in turn. 


Algorithm 3 PDAS Framework 
1: Input an initial partition {V,N, A) 

2: loop 

3: Compute the subspace minimizer (6, z) corresponding to (V,N, A) t> subspace minimization (SSM) 

4: if {6, z) is optimal, then terminate and return {6, z) t> termination check 

5: Compute a new partition {V, N, A) [> partition update 


Subspace Minimization Corresponding to an estimate (V^N,A) of the optimal partition 
{V, A*) there exists a unique primal-dual estimate (0, z) of (0*, z*). Denoting X as the union 
V U Af, the following schematics show the processes for computing {9, z) for problem ®. 


SSM for 3(61) = ||(D6l)+||i 

Set Zj t— 0 for j G Af and Zj <— 1 for j GV. 
Solve for {9, z a)'- 

I ad 5 ' 

XDa 0 

Set 

Vp^ljGV: {De)j < 0}; 
VN^ijGjX-. {De)i > 0}; 

Vap {j GA: Zj > 1}; 

Vajv ^ {j GA: Zj < 0 }. 
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y - XD^zx 


ZA 


0 


SSMfor3(6») = |jD6»||i 

Set Zj < -1 for j G Af and Zj <— 1 for j GV. 

Solve for (6^, za)'- 

I XD'^ 

XDa 0 

Set 

Vp^ijGV-. {De)j < 0 }; 

Vjv G- {j G Af : {D9)j > 0}; 

Vap ^ {j G a '■ Zj > 1}; 

Van ^ {j G a '■ Zj < — 1}. 
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y - XDxZi 


ZA 


0 
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Note that the solution {0,Zj\) of system 0 or (|^ could be efficiently obtained by 

solving for the system = £>^(2/ — A£>Jzx)/A, (3a) 

then setting 9 y — XD^z. (3b) 

When Z3_4 is a first (second) order difference matrix, Dj^D^ is a tridiagonal (quindiagonal) matrix, 
meaning that the left-hand side match in ( |^ is banded and z^i can be obtained cheaply. 

Termination Check One can easily show that a computed pair (0, z) is optimal if the set V = 
Vp U Vtv U Vap U Van, consisting of indices of D9 and z corresponding to violated bounds, is 
empty ||T9|. Thus, when V is empty, optimality has been reached and the algorithm terminates. 
Otherwise, these sets indicate a manner in which the partition could be updated. 

Partition Update In PDAS methods, the indices of D9 and z violating their bounds are deemed 
candidates for being switched from one index set in a partition to another. Specifically, a standard 
update in a PDAS method ini involves the following steps: 

V ^ ir\Vp) U Vap; 

Af ^ {V\Vn) U Van; (4) 

A -P- AI\(Vap U Van) U (Vp U Vat). 


4.3 Safeguarding 

There is no convergence guarantee for Algorithm for an arbitrary instance of (|TF| l; indeed, an 
example illustrating that the method can cycle when solving a BQP is given in 1181 . This example 
is presented here to illustrate that Algorithm]^ can cycle for certain instances of ( |TF| ). 

Example 4. 1. Lety = (603,996,502,19,56,139)^, A = 100, andg{e) = The iterates 

produced in the first few iterations of Algorithm^are shown in Table^ Since the algorithm returns 


Iter 

V 

AT 


DO 

z 

0 

1 

2 

3 

4 

{2,3,4} 

{3} 

{3} 

0 

{2,3,4} 

{1} 

0 

{1,2} 

{1} 

{1} 

0 

{1,2,4} 

{4} 

{2,3,4} 

0 

(13,-689,820,-254)'^ 

(0,0,1111,0)^ 
(-787,520,-16,0)^ 
(-If, 0,0,0)^ 

(13,-689,820,-254)"^ 

(-1,1,1,!)" 

/ 5293 482 5201 \T 

V 2280’ 475’ 5700 A 

(—1 _i 1 

V 100 A 

/ 1 127 371 943\T 

V 125 ’ 125 ’ 500 A 



Table 1: An illustration of Algorithm 3 cycling 

to a previously explored partition without computing an optimal solution, the algorithm cycles, i.e., 
it is not convergent for this problem instance from the given starting point. 

A simple safeguarding strategy to overcome this issue and ensure convergence is proposed in EOll 
and subsequently embedded in the work of ETTl . In particular, when |V| fails to decrease for sev¬ 
eral consecutive iterations, a backup procedure is invoked in which Q is modified to only change 
partition membership of one index of V. We present this approach in Algorithm]^ 

The safeguard of Algorithm|^employs a heuristic to decide whether to update the partition by Q or 
0. However, we have found an alternative strategy that performs better in our experiments. First, 
unlike that of ll^l2Tll . our safeguard changes the memberships of a portion of V, where the portion 
size is dynamically updated. Another difference in the safeguard design is that we employ a finite 
queue (first-in-first-out) to store recent values of |V| of which the maximum serves as the reference 
measure that diminishes as the algorithm continues. When an element is pushed into the queue that 
is full, the earliest element is removed. We present our approach in Algorithm]^ 

Algorithm|^can be understood as follows. If | V| = 0, then (0, z) is optimal and the algorithm termi¬ 
nates. Otherwise, the update Q is to be applied using only the [p| V|J indices from V corresponding 
to the largest violations. Ifp = 1/|V|, then this corresponds to moving only one index as in EOll . but 
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Algorithm 4 PDAS Framework with Safeguarding ll^ 

1: Input {VjAf, A) and an integer tn,ax 
2: Initialize Vbest ^ oo and I; t— 0 

3 : loop 

4: Compute the subspace minimizer (6, z) corresponding to {V,N, A) 

5: if |V| = 0, then terminate and return (9, z) 

6: if IV| < Vbest then 

7: Set t i — 0 and Vbest ^— |V| 

8: Apply l|^ 

9: else if |V| > Vbest then 

10: Set I ■<— i: + 1 

11: if t < tmax then 

12: Apply 

13: else if 1 > t^ax then 

14: Set j ■<— min{i : i £ V} and apply a partition update by 

moving j from V \o A, if ji G Vp 
moving j from A/” to A, if y G Vn 
moving j from A to "P, if ji G Vap 
moving j from A to Af, if j G Van 


Algorithm 5 PDAS Framework with Safeguarding 

1: Input ("P, A/”, A), queue Qm with size m, proportion p G (0,1]> parameter Js G (0,1) and 5e £ (l,oo) 

2: loop 

3: Compute the subspace minimizer (0, z) corresponding to {V,N, A) 

4: if |V| = 0, then terminate and return (6, z) 

5: Set max/min -t— maximum/minimum of Qm 

6: if|V| > max then 

7: setp ■<—max(5sP, 

8: else if I V| < min then 

9: push |V| into Qm and setp t— max{SsP, 1) 

10: else 

11: push |V| into Qm 

12: Sort V by max(A|D0|, \z\) and apply 0, on y changing the top p|V| indices 


if p S (1/|V|, 1], then a higher portion of violated indices may be moved. As long as the reference 
value—i.e., the maximum of the values in the queue—decreases, the value for p is maintained or is 
increased. However, if the reference value fails to decrease, then p is decreased. Overall, since the 
procedure guarantees that the reference value is monotonically decreasing and that p is sufficiently 
reduced whenever a new value for |V| is not below the reference value, our strategy preserves the 
convergence guarantees established in ll20l while yielding better results in our experiments. 

5 Experiments 

We implemented Algorithms and^in Python 2.7, using the Numpy (version 1.8.2) and Scipy 
(version 0.14.0) packages for matrix operations. In the following subsections, we discuss the results 
of numerical experiments for solving randomly generated instances of problems ( |Ir1 i and ( |TF| l. For 
problem ( |lR| l, merge operations for Algorithm]^ were carried out sequentially (but recall Figure[T]in 
which we have illustrated that they could be carried out in parallel). Throughout our experiments, 
we set u) as an all-one vector. 

5.1 Test on Isotonic Regression 

We compare the performance of Algorithm|^(PDAS) with a Python implementation of Algorithm[T] 
(PAV) that was later integrated into scikit-learn (version 0.13.1) l22l- The data yi are generated by 
Pi i + Si where Si ~ JV{0, 4). The initial partition is set as Jq = {{1}, {2},..., {n}} for both 
algorithms. We generated 10 random instances each for n G {1 x 10^, 5 x 10^,..., 33 x 10^}. 
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Boxplots for running time in seconds (Time (s)) and number of merge operations Merge) are 
reported in Figure 
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Figure 3; Comparison of PDAS and PAV in running time (left) and # of merges (right). 


Figure [^demonstrates that our implementation of PDAS outperforms PAV in terms of running time. 
That being said, when both use the set of singletons as the starting point, the numbers of merge 
operations performed by the two algorithms are nearly identical. 

Warm-starting Figure]^ does not show an obvious advantage of PDAS over PAV in terms of the 
numbers of merge operations. However, as claimed, we now show an advantage of PDAS in terms 
of its ability to exploit a good initial partition. We simulate warm-starting PDAS by generating 
an instance of as in our previous experiment, solving it with PDAS, and using the solution as 
the starting point for solving related instances for which the data vector y has been perturbed. In 
particular, for each problem size n, we generated 10 perturbed instances by adding a random variable 
Cj ~ 10“^) to each yi. The results of solving these instances are reported in Figures [^ and [^ 
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Figure 4; Comparison of warm-started PDAS and PAV in running time (left) and # of merges (right). 


Since PAV is not able to utilize a good initial partition, the required work (i.e., number of merge 
operations) for solving the perturbed problems is not cheaper than for the base instance. In contrast, 
warm-starting the PDAS algorithm can greatly reduce the computational cost as observed in the 
much-reduced number of merge operations (even after accounting for the added split operations). 

5.2 Test on Trend Filtering 

We now compare the performance of several PDAS variants for trend filtering. In particular, we 
compare the straightforward PDAS method of Algorithm]^ (PDAS), Algorithm [^(SFl), and the 
PDAS method with our proposed safeguard strategy in Algorithm|^(SF2). The safeguard parameter 
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Figure 5: Summary of warm-started PDAS in terms of # of splits. 


tmax = 5 was chosen for SFl and we similarly set m = 5, Ss = 0.9, and Sg = 1.1 for SF2. We 
generated 10 random problem instances each for n € {10"^, 1.7 x 10®, 3.3 x 10®} where, for each 
instance, the data vector had t/i uniformly distributed in [0,10]. Such datasets had minimum pattern 
and thus made each pro blem relatively difficult to solve. We considered both regularization functions 
Pi and defined in 14.1 with difference matrices and setting A = 10 in all cases. 

For all runs, we set an iteration limit of 800; if an algorithm failed to produce the optimal solution 
within this limit, then the run was considered a failure. The percentages of successful runs for each 
algorithm is reported in Table 


Table 2: Percentages of successful runs for each algorithm and problem type 


n (size) 

% of success 

9(^) = 
PDAS 

SFl 

SF2 

= 

PDAS 

\\D^^ 

SFl 

SF2 

PDAS 

SFl 

SF2 

9(0) = 
PDAS 

SFl 

SF2 

1.Oe+4 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

0.0 
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We observe from Table that all algorithms solved all instances when the regularization function 
involved a first-order difference matrix, but that PDAS and SFl both had failures when a second- 
order difference matrix is used. By contrast, our proposed safeguard in SF2 results in a method 
that is able to solve all instances within the iteration limit. This shows that our proposed safeguard, 
which allows more aggressive updates, can be more effective than a conservative safeguard. 

To compare further the performance of the algorithms, we collected the running time and iteration 
number for all successfully solved instances. Figure demonstrates that when D = all 

algorithms show very similar performance. However, when D = pjgm-e|^ the results 

show that SF2 is not only more reliable than PDAS and SFl; it is also more efficient even when 
SFl is successful. We also include the results for the interior-point method (1PM) proposed in im, 
but emphasize that this algorithm is implemented in Matlab (as opposed to Python) and is only set 
up to solve the instances when an ^i-regularization function is used. Although the interior point 
method demonstrates impressive performance, we remark that in general it is difficult to warm- 
starting interior point methods ll23l . despite recent efforts toward this direction Il24ll25ll2^ . 

Warm-starting 

We conclude our experiments by comparing the performance of I PM —which is not set up for warm- 
starting—and warm-started SF2. As in |5.1| we generated 10 perturbed instances for a given dataset 
by adding a random variable ~ A7(0,10“^) to yi. The running times and numbers of iterations 
for solving the perturbed problems are reported via boxplots in Figure]^ 

Comparing the performance of SF2 between Figures [6] [7} a nd [8| shows that warm-starting SF2 
can dramatically reduce the cost of solving an instance of (|TF|l. With a cold-start, SF2 may require 
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Figure 6: PDAS vs IPM for D = and different choices of g. 
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Figure 7: PDAS vs IPM for D = and different choices of g. 


hundreds of iterations, while with warm-starting it requires dramatically fewer iterations. In contrast, 
IPM does not benefit much from a good starting point. 


11 





8 


D SF2 

□ ipM 


35 


□ SF2 

□ IPM 


7 


6 

5 


(D 4 



2 


1 

0 


10k 


170k 

Size 


12 

10 



10k 170k 

Size 




330k 


30 
° 25 

4-) 

CO 20 
<V 

-M 15 

M 

* 10 
5 
0 


(a) 73 = 73(1’"), 3 = 


■ SF2 
□ IPM 




35 


g 30 


g 20 

M 15 
4fc 10 


— 

— 

— 

IGk 

= 51 

170k 

Size 

330k 

T 






5 

,—1 

t~~l 

330k 

® 10k 

170k 

330k 

(b) D = 

II 

Size 



° SF2 

n IPM 


Figure 8; PDAS (with warm-start) vs IPM. 


6 Concluding Remarks 

We propose innovative PDAS algorithms for Isotonic Regression (IR) and Trend Filtering (TF). 
For IR, our PDAS method enjoys the same theoretical properties as the well-known PAV method, 
but also has the ability to be warm-started, can exploit parallelism, and outperforms PAV in our 
experiments. Our proposed safeguarding strategy for a PDAS method for TF also exhibits reliable 
and efficient behavior. Overall, our proposed methods show that PDAS frameworks are powerful 
when solving a broad class of regularization problems. 
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